Is the Multigrid Method Fault Tolerant? The Multilevel Case
نویسندگان
چکیده
Computing at the exascale level is expected to be affected by a significantly higher rate of faults, due to increased component counts as well as power considerations. Therefore, current day numerical algorithms need to be reexamined as to determine if they are fault resilient, and which critical operations need to be safeguarded in order to obtain performance that is close to the ideal fault-free method. In a previous paper [1], a framework for the analysis of random stationary linear iterations was presented and applied to the two grid method. The present work is concerned with the multigrid algorithm for the solution of linear systems of equations, which is widely used on high performance computing systems. It is shown that the Fault-Prone Multigrid Method is not resilient, unless the prolongation operation is protected. Strategies for fault detection and mitigation as well as protection of the prolongation operation are presented and tested, and a guideline for an optimal choice of parameters is devised.
منابع مشابه
Asynchronous parallel solvers for linear systems arising in computational engineering
Modern trends in Computational Science and Engineering are moving towards the use of computer systems with ever increasing numbers of computational cores. A consequence of this is that over the next decade it will be necessary to develop and apply new numerical algorithms that are far more scalable than has historically been required. Ideally, such algorithms will be able to exploit many thousa...
متن کاملFault Diagnosis and Fault-Tolerant SVPWM Technique of Six-phase Converter under Open-Switch Fault
In this paper, a new open-switch fault diagnosis method is proposed for the six-phase AC-DC converter based on the difference between the phase current and the corresponding reference using an adaptive threshold. The open-switch faults are detected without any additional equipment and complicated calculations, since the proposed fault detection method is integrated with the controller required ...
متن کاملFault tolerant system with imperfect coverage, reboot and server vacation
This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...
متن کاملFault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
We examine novel fault tolerance schemes for data loss in multigrid solvers which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We...
متن کاملIs the Multigrid Method Fault Tolerant? The Two-Grid Case
The predicted reduced resiliency of next-generation high performance computers means that it will become necessary to take into account the effects of randomly occurring faults on numerical methods. Further, in the event of a hard fault occurring, a decision has to be made as to what remedial action should be taken in order to resume the execution of the algorithm. The action that is chosen can...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- SIAM J. Scientific Computing
دوره 39 شماره
صفحات -
تاریخ انتشار 2017